Hi PD team,
- We are getting the alerts like below. But the message is not useful for a engineer as it does not convey nice description. We have multiple alerts where in prometheus metrics we have the job or pod. While job is there pod will be there but vice versa is not true. If prometheus metrics is for deployment pod will be there but not the job.
So wanted descritpion should be more meaningful for job and pod. So that engineer can work on the alerts.
Below is the alert which we are getting but the message/description is not meaningful.
- Labels:
- alertname = KubeJobCompletion
- cluster = pre2
- endpoint = http
- instance = 10.2.45.126:8080
- job = kube-state-metrics
- job_name = vault-backup-1581410820
- namespace = dev-vault
- pod = prometheus-operator-kube-state-metrics-7ccb55bbc7-nssmp
- prometheus = dev-monitoring/dev-prometheus
- service = prometheus-operator-kube-state-metrics
- severity = warning
Annotations:
- message = Job dev-vault/vault-backup-1581410820 is taking more than one hour to complete.
- runbook_url = https://someurl/
Below is the configurations
global:
resolve_timeout: 5m
route:
group_by: [ 'namespace', 'pod', 'job', 'severity', 'instance']
group_wait: 30s
group_interval: 5m
repeat_interval: 12h
receiver: 'dev-teams-test' #Default rec. Maybe a slack channel for non-critical/fallback alerts.
routes:
- receiver: dev-pager
match_re:
## The default namespace is included because some API alerts specify that namespace.
## The trailing '|' is intentional. This will give us alerts from any non-namespaced alerts
namespace: dev.*|kube.*|istio.*|default|
severity: critical|error|warning
continue: true
receivers:
- name: dev-pager
pagerduty_configs:
- send_resolved: true
routing_key:
url: https://events.pagerduty.com/v2/enqueue
severity: '{{ if .CommonLabels.severity }}{{ .CommonLabels.severity | toLower }}{{ else }}critical{{ end }}'
description: '{{ if .CommonLabels.job }}{{ .CommonLabels.job | toLower }}{{ else }}{{ .CommonLabels.pod | toLower }}{{ end }}'
In the configurations, I tried to add the descritpion, is it valid? If not could you please suggest or share the example.
- Second issue is basically for the alert name which we are getting
Alert subject/Name : [FIRING:6] 10.2.45.126:8080 kube-state-metrics dev-vault warning (pre2 http prometheus-operator-kube-state-metrics-7ccb55bbc7-nssmp dev-monitoring/dev-prometheus prometheus-operator-kube-state-metrics)
Here in command label we are getting un wanted data like http.
How could we delete the http.
alertmanager/tempalte entry which is adding the common labels.
{{ define "__subject" }}[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .GroupLabels.SortedPairs.Values | join " " }} {{ if gt (len .CommonLabels) (len .GroupLabels) }}({{ with .CommonLabels.Remove .GroupLabels.Names }}{{ .Values | join " " }}{{ end }}){{ end }}{{ end }}
Kindly suggest on both?